Part I - Ford GoBike System Dataset Analysis

by Ajani Ayooluwa

Introduction

In this project, I will be working with the Ford GoBike System dataset. The dataset consists of information regarding the rides made in a bike-sharing system covering the greater San Francisco Bay area for the month of february in the year 2019. The dataset features include the duration of the trip, the time at which the trip started, the time at which the trip ended, the name of the station at which the passengers embarked and much more.

Preliminary Wrangling

Importing The Dataset

Visual Assessment of the given dataset.

Programmatic Assessment of the given dataset.

A concise summary of the data.

Quality Issue: Incorrect datatype, the start_time and end_time columns have the string datatype(i.e object) instead of the datetime datatype.
Quality Issue: Incorrect datatype, member_birth_year, end_station_id and start_station_id columns have the float datatype instead of the Integer datatype.

Checking the percentage of missing values in each field in the data.

Quality Issue: The member_birth_year, member_gender, start_station_id, start_station_name, end_station_id and end_station_name have missing values and the rows contain the missing values will have to be dropped.

Checking the number of unique values in each column

There are only 329 stations in dataset.

Checking for duplicated rows

There are no duplicated rows in that dataset.

Checking the descriptive Statistics

The duration_sec column appears to have some outliers as the difference between the 75th percentile and the 100th percentile is quite alarming and requires further investigation.

Data Cleaning

Issues Identified

  1. Incorrect datatype, the start_time and end_time columns have the string datatype(i.e object) instead of the datetime datatype.
  2. Missing Values in the member_birth_year, member_gender, start_station_id, start_station_name, end_station_id and end_station_name column.
  3. Incorrect datatype, member_birth_year, end_station_id and start_station_id columns have the float datatype instead of the Integer datatype.

Making a copy of the Ride Dataframe

Define
Incorrect datatype, the start_time and end_time columns have the string datatype(i.e object) instead of the datetime datatype.

Code

Test

Define
Missing Values in the member_birth_year, member_gender, start_station_id, start_station_name, end_station_id and end_station_name columns.

Code

Test

Define
Incorrect datatype, member_birth_year, end_station_id and start_station_id columns have the float datatype instead of the Integer datatype.

Code

Test

The structure of the dataset
There were initially 183,412 rides in the dataset with 16 features which was later dropped to 174952 rides observations after data cleaning. It has 5 categorical columns, 2 datetime columns and 9 numerical columns, totalling at only 16 columns in all.

The main feature of interest in the dataset
I am most interested in the how and what makes the duration of the trips longer or shorter.

The features in the dataset support my investigation into the ride duration feature
I believe that the user type, gender, member birth year, bike share variables will help provide the much needed information for my investigation into the trip duration of the riders.

EXPLORATION DATA ANALYSIS

Univariate Exploration

What is the average trip duration?

Observations

From the graph above, it is clear that most of the values in the ride duration column fall in the range of 0 and 7000 seconds and the graph is right skewed. There are no nulls in the columns but ride duration column appears to have some outliers which we should investigated further. Let’s start by checking the cases where rides has a duration of over 7000 seconds.

There are 557 cases where the ride durations are above 7000 seconds, this might be due to interstate trips. We need to zoom into those cases below 7000 to have a better grasp of the data.

The skewness is easier to view now. The values in the graph are too widely spread, they will require an axis transformation to fully have a grasp on the data.

Observation

The values in the ride duration column below are not normally distributed on the logarithmic scaling of the x-axis having most trips between 100 and 2000 seconds long and a peak at about 500 seconds.

Which gender took the highest number of trips?

Observation

There is a higher percentage of trips taken by males riders than any other gender.

Do more riders generally share bikes for the trips embarked on?

Observation

There is a huge difference between the number of riders who share the bikes with others and those that do not. It is clear to see that riders generally do not share the bikes with others during their trips.

What is the name of the station where the most people got in?

Observation

The most customers got on at the Market Station at 10th street, this is most likely a station closest to a high populated residential area.

What is the name of the end station where the most people got off?

Observation

The station where the most people departed at is San Francisco Caltrain Station 2 (Townsend St at 4th St), which is the station where the second highest number of people got on.

Does age affect the number of rides embarked on?

Observation

From the graph above, it is clear that the active members are between 18 and 50 years. The graph is left skewed. The column appears to have some outliers which we should investigate further. Let’s start by investigating the case where members are 100 years or older.

There is only 72 instance where the members are 100 years or older. The dataset was gathered in the month of february 2019, it is unlikely for individuals over the age of 100 to actually go on the bike trips considering the fact that they are aging and their bodies are not as agile. The ages greater than 100 are outliers and need to be dropped as they are inaccurate and might affect further analysis that will be carried out.

Which user type takes the most trips?

Observation

There is a huge difference between the number of people who are subscribers and those who are not. It is clear to see that people generally are subcribers. The Subscriber riders took the most trips.

What is the busiest day of the week?

Observation

Suprisingly the day of the week where the highest number of trips occured is not monday but actually is Thurday, while the weekend as expected have the least number of trips this maybe be due to fact that most people being spend the weekend resting at home.

What is the busiest hour of the day?

Observation

The Busiest hours of the day in the morning are 8am and 9am, while in the evening, the busiest hours are 5pm and 6pm.
The trend maybe due to the work arrangements of the working class where most people leave in the morning and return in the evening.

Using the number of trips per hour, is the Market Station at 10th Street (i.e. the station at which the most people got on) located near a residential area?

Observation

From the graph above, it can be observed that there are two spikes, the first around the hours 8 to 9 in the morning, which could imply that the station is located close to a business district as most jobs start by 8 or 9 am, the second spike around 5 to 6 pm implies that the station although closely located to a business district is also located near a residential area.

Using the number of trips per hour, is the San Francisco Caltrain Station 2 located near a highly populated area?

Observation

It can be observed that more people get off at the San Francisco Caltrain Station 2, there is a prominent spike around the hours 4 to 6 in the afternoon where more people arrived at the station, this could imply that the station is located close to a highly populated residential area.

When are most trips taken in terms of day of the month?

Observation

The weekends generally have less rides than other days of the weekends so there being a drop in the number of rides it to be expected but on the 13th of the month, there's also a huge drop in the number of rides, this might be attributed to the fact that it's a day away from valentine's day. The day with the highest number of rides is 28th of the month which is also a thursday.

Bivariate Exploration

Do younger people prefer shorter trips?

Observation

The young people make the most trips but generally it can be seen with the graph above that Trip duration has an inverse correlation with the age. Older people generally prefer shorter trips.

Which gender takes longer trips?

Observation

The values are too wide spread in the visualization, we need to zoom into the key areas of interest.

Observation

Although there are more male riders, it is clear that the other gender riders generally embark on trips of longer durations than their counterparts.

Which day of the week on average has the longest trips?

Observation

Weekend trips take longer compared to trips taken on other days of the week. Most visits to family occur on the weekend and this might be the cause of this.

Which user type takes longer trips?

Observation

Although there are more subscribers than customers, the customer riders trips take a longer duration than the subscribers.

Which user types shares more bikes for trips?

Observation

From the graph above, it seems like only the subscribers are allowed bike share premium right.

Multivariate Exploration

Which user type has the longest trip duration by bike share?

Observation

As observed earlier, the customer user type does not seem to have the bike share feature. For the subsciber riders, the shared trip on average have longer durations than those not shared.

Which user type has the longest trip duration by gender?

Observation

In the graph above, it can be observed that for both user types, the female riders had longer ride durations than the male riders. For both user types, the other gender riders take longer trips than their counterparts.

Which day of the week by user type has the longest trips?

Observation

Due to the very short trip duration we can conclude that the subscriber user type do not travel long distances. The day of the week with longest trips is sunday for both user types.

Conclusions

The dataset contains rides for the months of February and March 2019 with the average duration for all trips been 500 seconds, having above 74% of the trips embarked on by male riders. The most number of trips were taken on Thursdays while the Weekends have the least number of trips taken as most people prefer to rest at home during the weekends. The most trips were taken between 8 and 9 in the morning and 5 and 6 in the evening which could be as a result of the work rush and return. The active riders are between the ages of 18 and 50 years. There is a decrease in the average ride durations as ages increase above 50 years, which could imply that older riders prefer shorter trips due to their body strength or agility to ride for longer periods. Above 90% of the trips were taken by subscribers, it could be observed that less than 10% of the trips were shared and this was observed for only the subscribers, which could imply that only the subscribers are allowed to share bikes on trips.

Saving the cleaned dataframe